In previous notebooks, we have assigned each cell to its condition (0h, 8h, etc.), we have filtered low-quality cells, and we have normalized gene counts to correct for biases such as differences in library size. The result of that is a SingleCellExperiment object that we saved as .RDS and that will be the starting point of this analysis.
Here, we aim to cluster cells to identify each cell type. Hence, we are going to use Seurat, a CRAN package that has become a swiss-knife for scRNA-seq analysis. As described in Kiselev et al, Seurat uses a graph-based clustering algorithm that is scalable to datasets with thousands of cells. Therefore, we will leverage such scalability to cluster >10,000 cells contained in the SCE object.
library(SingleCellExperiment)
library(scater)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(Seurat)
library(cowplot)
library(ggpubr)
library(purrr)
library(tidyverse)
source("bin/utils.R")
Seurat uses its own single-cell data container, a so-called Seurat object. Hence, we first need to convert the SCE to this new data structure:
date <- Sys.Date()
# Load SingleCellExperiment
sce <- readRDS("results/R_objects/10X_SingleCellExperiment_filt&norm.RDS")
# To increase interpretability downstream, change rownames from ensembl to gene
# symbol
rownames(sce) %>%
table() %>%
sort(decreasing = TRUE) %>%
head(10)
## .
## PNRC2 SRSF10 7SK A1BG A2M-AS1 AAAS AACS AAED1 AAGAB AAK1
## 2 2 1 1 1 1 1 1 1 1
ind <- match(c("PNRC2", "SRSF10"), rownames(sce))
rownames(sce)[ind] <- c("PNRC2.1", "SRSF10.1")
# Convert SCE to Seurat
seurat <- as.Seurat(sce)
As we have two donors (male/female), we will annotate them separately.
To cluster our cells, we need to overcome 3 challenges:
A first approach to tackle these challenges consists of finding the most variable genes in the dataset (feature extraction). That is, to find the subset of genes that drive most of the variability in the expression matrix. Seurat calculates the average expression and dispersion for each gene. Then, it divides genes into bins based on its average, and for each bin computes a z-score per gene. Those genes with a z-score above a certain cutoff are categorized as highly variable. The binning step is vital, since genes with more UMI tend to have more dispersion.
seurat_l <- SplitObject(seurat, split.by = "donor")
seurat_l <- purrr::map(seurat_l, FindVariableFeatures)
purrr::map(seurat_l, ~ length(VariableFeatures(.x)))
## $male
## [1] 2000
##
## $female
## [1] 2000
purrr::map(seurat_l, VariableFeaturePlot)
## $male
##
## $female
As we can see, we reduce the number of dimensions from >10,000 genes to 2000 HVG.
An important pre-processing step in any cluster analysis is to scale the data, as otherwise variables with a higher mean will have a higher weight in the distance metric. We regress out the “batch” variable:
seurat_l <- purrr::map(seurat_l, ScaleData, vars.to.regress = "batch")
An additional challenge in our cluster analysis is that scRNAs-seq is very noisy (very susceptible to technical artifacts), and very sparse (contains drop-outs). Thus, differences in single genes may not be accurate to identify cell types. To that end, we can perform PCA, in which each PC can be conceived as a ‘metagene’ that includes information across a correlated gene set. Furthermore, we will reduce the dimensionality even more.
seurat_l <- purrr::map(seurat_l, RunPCA)
purrr::map(seurat_l, VizDimLoadings, dims = 1:2)
## $male
##
## $female
purrr::map(seurat_l, PCHeatmap, dims = 1, balanced = TRUE)
## $male
## NULL
##
## $female
## NULL
Seurat uses the Louvain algorithm to cluster cells:
seurat_l <- purrr::map(seurat_l, FindNeighbors, dims = 1:15)
seurat_l <- purrr::map(seurat_l, FindClusters, resolution = 0.05)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 7471
## Number of edges: 281204
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9739
## Number of communities: 4
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2893
## Number of edges: 110323
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9781
## Number of communities: 4
## Elapsed time: 0 seconds
We can visualize the former clusters with a t-Stochastic Neighbor Embedding (tSNE) and a Uniform Manifold Approximation and Projection (UMAP), which allow to depict more structure in the data than PCA:
seurat_l <- purrr::map(seurat_l, RunTSNE, dims = 1:15)
seurat_l <- purrr::map(seurat_l, RunUMAP, dims = 1:15)
seurat_l <- purrr::map(seurat_l, function(s) {
s@reductions$PCA <- NULL
s@reductions$TSNE <- NULL
s@reductions$UMAP <- NULL
s
})
purrr::map(seurat_l, DimPlot, reduction = "tsne")
## $male
##
## $female
purrr::map(seurat_l, DimPlot, reduction = "umap")
## $male
##
## $female
As we can see, there are 5 major clusters. Interestingly:
purrr::map(seurat_l, function(s) {
Idents(s) <- "batch"
DimPlot(s)
})
## $male
##
## $female
Regressing out the batch effect with the ScaleData function from above removed the majority of the batch effect.
Let us find the markers of each of the clusters above. In other words, let us find which genes are exclusively expressed in each cluster and will help us identify the cell types in our data set:
markers <- purrr::map(seurat_l, FindAllMarkers, only.pos = TRUE)
DT::datatable(markers$male)
DT::datatable(markers$female)
saveRDS(markers, file = "results/R_objects/pbmc_markers_list.rds")
marker_selection <- purrr::map2(seurat_l, markers, function(seurat, df) {
num_k <- length(unique(seurat$seurat_clusters))
selection <- unlist(purrr::map(0:(num_k-1), ~df[df$cluster == .x, "gene"][0:8]))
selection
})
marker_selection <- purrr::map(marker_selection, ~ c("IL7R", .x))
heatmaps_markers <- purrr::map2(seurat_l, marker_selection, ~DoHeatmap(.x, features = .y) + NoLegend())
heatmaps_markers
## $male
##
## $female
Based on the previously found markers, we can annotate each cluster to known cell types:
| Cluster ID | Markers | Cell Type |
|---|---|---|
| 0 | IL7R | T cells |
| 1 | GNLY, NKG7 | Natural Killer (NK) |
| 2 | LYZ, S100A8 | Monocytes |
| 3 | MS4A1, CD79B | B cells |
seurat_l$male$cell_type <- factor(seurat_l$male$seurat_clusters)
levels(seurat_l$male$cell_type) <- c("T", "NK", "Monocyte", "B")
| Cluster ID | Markers | Cell Type |
|---|---|---|
| 0 | IL7R | T cells |
| 1 | GNLY, NKG7 | Natural Killer (NK) |
| 2 | LYZ, S100A8 | Monocytes |
| 3 | MS4A1, CD79B | B cells |
seurat_l$female$cell_type <- factor(seurat_l$female$seurat_clusters)
levels(seurat_l$female$cell_type) <- c("T", "NK", "Monocyte", "B")
We can visualize the annotation with all cells pooled together:
cell_type_df <- purrr::map(seurat_l, ~ .x@meta.data[, "cell_type", drop = FALSE])
cell_type_df <- cell_type_df %>%
purrr::map(rownames_to_column, var = "barcode") %>%
bind_rows(.id = "donor")
rownames(cell_type_df) <- cell_type_df$barcode
cell_type_df <- cell_type_df[colnames(seurat), ]
seurat$cell_type <- cell_type_df$cell_type
seurat <- pre_process_seurat(seurat, vars_to_regress = "batch")
Idents(seurat) <- "cell_type"
umap_annot <- DimPlot(seurat, reduction = "umap", label = TRUE) + NoLegend()
umap_markers <- FeaturePlot(seurat, features = c("IL7R", "GNLY", "LYZ", "MS4A1"), reduction = "umap")
saveRDS(umap_annot, file = "results/R_objects/ggplots/umap_annotation_pbmc.rds")
saveRDS(umap_markers, file = "results/R_objects/ggplots/umap_markers_pbmc.rds")
umap_annot
umap_markers
Let us summarise the number of cells per cell type statified by donor and temperature
qc_cell_type_pbmc <- seurat@meta.data
qc_cell_type_pbmc$temperature <- case_when(
qc_cell_type_pbmc$condition == "0h" ~ "0h",
qc_cell_type_pbmc$condition %in% c("2h", "8h", "24h_RT", "48h_RT") ~ "21ºC",
qc_cell_type_pbmc$condition %in% c("24h_4C", "48h_4C") ~ "4ºC"
)
qc_cell_type_pbmc$time <- qc_cell_type_pbmc$condition %>%
str_remove("_4C") %>%
str_remove("_RT")
levels(qc_cell_type_pbmc$cell_type) <- c("T-cell", "NK", "Monocyte", "B-cell")
qc_cell_type_pbmc <- qc_cell_type_pbmc %>%
mutate(time = factor(time, levels = c("0h", "2h", "8h", "24h", "48h")),
temperature = factor(temperature, levels = c("0h", "21ºC", "4ºC")),
donor = factor(donor, levels = c("male", "female"))) %>%
group_by(time, donor, temperature, cell_type) %>%
summarise(num_cells = n()) %>%
ungroup()
colnames(qc_cell_type_pbmc)[4] <- "annotation"
qc_cell_type_pbmc <- add_column(
qc_cell_type_pbmc,
experiment = rep("PBMC", nrow(qc_cell_type_pbmc)),
.after = 0
)
DT::datatable(qc_cell_type_pbmc)
saveRDS(qc_cell_type_pbmc, "results/R_objects/qc_summary_table_cell_type_pbmc.rds")
# saveRDS(seurat, "results/R_objects/10X_pbmc_Seurat_clustered.RDS")
# saveRDS(seurat_l, "results/R_objects/10X_pbmc_Seurat_donors_list_clustered.RDS")
# Recode variables
# seurat <- colData(sce2)[, c("batch", "donor", "ident", "condition")]
# conds <- c("0h", "2h", "8h", "24h_RT", "48h_RT", "24h_4C", "48h_4C")
# sce2$condition <- factor(sce2$condition, levels = conds)
#
# levels(sce2$condition) <- conds %>%
# str_remove("RT") %>%
# str_remove("_")
# sce2$temperature <- case_when(
# sce2$condition == "0h" ~ "gold",
# sce2$condition %in% c("2h", "8h", "24h", "24hBioabank", "48h") ~ "room temperature",
# sce2$condition %in% c("24h4C", "48h4C") ~ "4ºC"
# )
# sce2$condition <- str_remove(as.character(sce2$condition), "4C")
# colnames(colData(sce2)) <- c("batch", "sex", "cell_type", "time", "temperature")
#
# # Recode rowData variables
# rowData(sce2) <- rowData(sce)
#
# # Save as RDS
# saveRDS(sce2, "results/R_objects/10X_SingleCellExperiment_clustered.RDS")
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.0 stringr_1.4.0 dplyr_0.8.4 readr_1.3.1 tidyr_1.0.2 tibble_2.1.3 tidyverse_1.3.0 purrr_0.3.3 ggpubr_0.2.5 magrittr_1.5 cowplot_1.0.0 Seurat_3.1.4 org.Hs.eg.db_3.10.0 AnnotationDbi_1.48.0 scater_1.14.6 ggplot2_3.3.0 SingleCellExperiment_1.8.0 SummarizedExperiment_1.16.1 DelayedArray_0.12.2 BiocParallel_1.20.1 matrixStats_0.55.0 Biobase_2.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.0 IRanges_2.20.2 S4Vectors_0.24.3 BiocGenerics_0.32.0 BiocStyle_2.14.4
##
## loaded via a namespace (and not attached):
## [1] reticulate_1.14 tidyselect_1.0.0 RSQLite_2.2.0 htmlwidgets_1.5.1 grid_3.6.1 Rtsne_0.15 munsell_0.5.0 codetools_0.2-16 mutoss_0.1-12 ica_1.0-2 DT_0.12 future_1.16.0 withr_2.1.2 colorspace_1.4-1 knitr_1.28 rstudioapi_0.11 ROCR_1.0-7 ggsignif_0.6.0 gbRd_0.4-11 listenv_0.8.0 Rdpack_0.11-1 labeling_0.3 GenomeInfoDbData_1.2.2 mnormt_1.5-6 bit64_0.9-7 farver_2.0.3 vctrs_0.2.3 generics_0.0.2 TH.data_1.0-10 xfun_0.12 R6_2.4.1 ggbeeswarm_0.6.0 rsvd_1.0.3 bitops_1.0-6 assertthat_0.2.1 promises_1.1.0 scales_1.1.0 multcomp_1.4-12 beeswarm_0.2.3 gtable_0.3.0 npsurv_0.4-0 globals_0.12.5 sandwich_2.5-1 rlang_0.4.5 splines_3.6.1 lazyeval_0.2.2 broom_0.5.5
## [48] BiocManager_1.30.10 yaml_2.2.1 reshape2_1.4.3 modelr_0.1.6 crosstalk_1.0.0 backports_1.1.5 httpuv_1.5.2 tools_3.6.1 bookdown_0.18 gplots_3.0.3 RColorBrewer_1.1-2 ggridges_0.5.2 TFisher_0.2.0 Rcpp_1.0.3 plyr_1.8.6 zlibbioc_1.32.0 RCurl_1.98-1.1 pbapply_1.4-2 viridis_0.5.1 zoo_1.8-7 haven_2.2.0 ggrepel_0.8.1 cluster_2.1.0 fs_1.3.2 data.table_1.12.8 RSpectra_0.16-0 magick_2.3 lmtest_0.9-37 reprex_0.3.0 RANN_2.6.1 mvtnorm_1.1-0 fitdistrplus_1.0-14 xtable_1.8-4 mime_0.9 hms_0.5.3 patchwork_1.0.0 lsei_1.2-0 evaluate_0.14 readxl_1.3.1 gridExtra_2.3 compiler_3.6.1 KernSmooth_2.23-16 crayon_1.3.4 htmltools_0.4.0 later_1.0.0 RcppParallel_4.4.4 lubridate_1.7.4
## [95] DBI_1.1.0 dbplyr_1.4.2 MASS_7.3-51.5 Matrix_1.2-18 cli_2.0.2 gdata_2.18.0 metap_1.3 igraph_1.2.4.2 pkgconfig_2.0.3 sn_1.5-5 numDeriv_2016.8-1.1 plotly_4.9.2 xml2_1.2.2 vipor_0.4.5 multtest_2.42.0 XVector_0.26.0 bibtex_0.4.2.2 rvest_0.3.5 digest_0.6.25 sctransform_0.2.1 RcppAnnoy_0.0.15 tsne_0.1-3 rmarkdown_2.1 cellranger_1.1.0 leiden_0.3.3 uwot_0.1.5 DelayedMatrixStats_1.8.0 shiny_1.4.0 gtools_3.8.1 lifecycle_0.1.0 nlme_3.1-145 jsonlite_1.6.1 BiocNeighbors_1.4.2 viridisLite_0.3.0 fansi_0.4.1 pillar_1.4.3 lattice_0.20-40 fastmap_1.0.1 httr_1.4.1 plotrix_3.7-7 survival_3.1-8 glue_1.3.1 png_0.1-7 bit_1.1-15.2 stringi_1.4.6 blob_1.2.1 BiocSingular_1.2.2
## [142] caTools_1.18.0 memoise_1.1.0 irlba_2.3.3 future.apply_1.4.0 ape_5.3